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Abstract 

Binomial data with unknown sizes often appear in biological and medical sciences and are usually overdispersed. 
All previous methods used parametric models and only considered overdispersion due to the variation of sizes. The 
proposed semiparametric model considers overdispersion due to the variation of sizes and that of probabilities. By 
doing this, it can include variations caused by observations, missing covariates, and random measurement errors in 
covariates. An Expectation Conditional Maximization algorithm is provided to stabilize the loglikelihood optimiza- 
tion. Selecting the number of support points of the mixing distributions and the bootstrap methods are also discussed. 
Simulation is done to evaluate the performance of the proposed model. Two real examples are used to illustrate the 
proposed model. 
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1 Introduction 

The study about the binomial data with unknown sizes can be dated back to Wadley (1949), which made a fictitious 
experimental data set. In the experiment, fruits were infested with fruit-fly larvae and exposed to low temperature with 
varying days. Some of the larvae would die from low temperature. The number of fruit-flies that were seen to emerge 
was counted, but the initial larvae number was not known. 

A lot of real binomial data with unknown sizes appeared in the literature. For example, Morton (1981) presented 
a data set about the disinfestation of wheat. Margolin et al. (1981) studied the effects of quinoline on the number 
of revertant colonies of Salmonella strain TA98. Elder (1996) investigated the relationship between the survival of 
V79-473 cells and their times in the high heat. Bailer and Piegorsch (2000) took the effect of nitrofen on the offspring 
of C. dubia as an example. 

Let yi denote a binomial random variable with size rii unknown and probability p L , and Xi denote a vector of 
covariates of length g, i = 1,2, ... ,r. The issue of interest is to investigate the relationship between the covariates Xi 
and the probabilities pi. If overdispersion exists, there will be three cases: overdispersion due to only the variability 
of pi, due to only the variability of rii, or due to both of them (Elder et al. 1999). 

The binomial data with unknown sizes are usually approximated by Poisson distributions (e.g. Wadley 1949 and 
Margolin et al. 1981). However, such an approximation is not reasonable for moderate sizes or probabilities (e.g. 
Elder et al. 1999). Anscombe (1949) considered overdispersion due to the n% and provided a parametric model 
based on the negative binomial distribution. Baker et al. (1980) treated yi in the control group as a Poisson random 
variable with mean m and that in the treatment group with mean mpi, where a probit dose-response relationship is 
assumed. Trajstman (1989) modified the method of Baker et al. (1980) to allow a logistic dose-response relation and 
incorporated overdispersion by assuming a scaled Poisson variance-mean relationship. Based on Baker et al. (1980), 
Morgan and Smith (1992) used a full negative- binomial distribution and incorporated extra Poisson variation. Kim 
and Taylor (1994) and Elder et al. (1999) developed a quasi-likelihood approach by regarding j/j as a binomial random 
variable. Kim and Taylor (1994) assumed that E(rii) = rrii and var (n,) = m^v with rrii known and v ^ 1 unknown. 
Elder et al. (1999) estimated m = E(rii) with var (rii) — m(l + vm) and v ^ 0. All previous methods used 
parametric models and only considered overdispersion due to the variation of rii. 
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We propose a semiparametric model to incorporate overdispersion due to the variability of rii and p^ The rii are 
assumed to be Poisson distributed with means from an unspecified mixing distribution. With the mean of rii being a 
random variable, overdispersion due to the variation of rii is taken into account. It is assumed that pi = p(r) + x^/3), 
where the 77 are further assumed to follow another unspecified mixing distribution. With 77 being a random variable, we 
include variations caused by observations, missing covariates, and random measurement errors in covariates (Follmann 
and Lambert 1989). 

Section 2 is the method part, and includes the proposed model, the Expectation Conditional Maximization (ECM) 
algorithm, how to select the number of support points, and the bootstrap methods. A simulation study is presented in 
Section 3. Two real examples of bioassay are investigated in Section 4. 



2 Methods 
2.1 A model 

In the proposed model, a logistic link is assumed, and its inverse is 

exp(?i + x' t (3) 



Pi = p(v + x'iP) 



1 + cx P (t ? + x[(3) ' 



The unknown size rii is assumed to be a Poisson random variable with mean £j. It can be easily shown that rj ~ 
Pois(£ip(?7 + x^/3)). The nuisance parameters £j and r\ are further assumed to follow mixing distributions G and H, 
respectively. Since the parameter of interest (3 is in an Euclidean space of dimension g, a semiparametric model arises 
when G and H are nonparametric. The density of a single observation (y, x) is 

f(y;x,(3,G,H) = J J f(y;x, (3, A, a)dG(X)dH(a), 

where f(y; x, (3, A, a) is a Poisson density with mean Xp(a + x'f3), i.e., 

f(y; x, (3, A, a) = cxp{-\p(a + x'/3)}{Xp(a + x'/3)} y /y\, y = 0,l,.... 
The log likelihood can be written as 

r 

e(l3,G,H) =^]o S f(y i ;x i ,0,G,H). (1) 



2.2 An ECM algorithm 

Since any distribution can be approximated by a discrete distribution, we assume G and H are discrete distributions. 
First, we consider that G and H have a fixed number of support points K\ and K2, which will be allowed to change 

in Section 2.3. Let G = J2j=i Pj S (^j) and H = Z) m =i where Z)j=i Pj = Lmli n m = L Pj > 0, 

TT m > 0, S is the indicator function, Xj G (0, 00), and a m G 1Z. Let p = (pi,p%, . . . , PKi)'< A = (Ai, A2, . . . , A^)', 
7r = (7Ti, 7r 2 , . . . , 7Tk- 3 )', a = <X2, . . . , a_ftr 2 )', and 8 —{(3, p, A, 7r, a). The log likelihood becomes 

H X! Pjir m f{yi;Xi,/3, X j: a m ) > . (2) 

^ — 1 m— 1 J 

Since direct maximization of ^(0) in (0 is extremely difficult, an EM algorithm may be considered. However, the 
M-step in the EM algorithm may be computationally unreliable because of so many parameters. 

The ECM algorithm (Meng and Rubin 1993, McLachlan and Peel 2000, pl48) is promising. The ECM algorithm 
simplifies the M-step by replacing the complicated M-step with five computationally simpler and stabler CM-steps. 
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But it also keeps the advantage of the EM algorithm, and increases the log likelihood in each iteration (Meng and 
Rubin 1993). 

Suppose the missing data are the indicator vectors for the pair (x,y), z\ = (zu, z\2, z\k^)' and z 2 = 
(^21,^22, ■ Z2K 2 )'< where z\j — 1 means that £ = Aj and Z2 m = 1 means that 77 = a m . Note that z± and Z2 
are independently multinomial distributed with size one and probability p and 7r, respectively. The complete density 
for a single datum (x, y, Zi,z 2 ) is 

nf^n^Li [Pjnmfiy; x,p, \ h a m T^ m . 

The joint complete log likelihood is 

r K\ K 2 

Zc{0) = EE E Z ^ z *2m{ logpj + log7r m + \og[f (y l ; x t , (3 , \j,a m )]}. 

i—1 j — 1 m— 1 

The expected conditional complete log likelihood to be maximized is 

w(o-e w ) = e 0(o) {e c (e)\ yi ,y2,...,y r }. 

In the E-step, the conditional expectation of ZnjZi2 m is calculated, i.e., for i = 1,2, ... ,r,j = 1,2, ... ,K\,m = 
1,2,. ..,K 2 , 

(0) „ , , , p^Wftoxi^^Ka®) 



e \jL = E e (o)(z llj z l2m \yi,y2, -,y r ) 



t k 2 (0) (0) n . fl (0) x(0) (0)x- 

In the CM-step, the expected conditional complete log likelihood 



r Kx K 2 r K x K 2 

* (0) ) = E E E e £ ft+EEE e £ log 

i— 1 j — 1 m— 1 z— 1 j — 1 m— 1 

r -R"! K 2 

EE E e «m kg A Aj-,a m ) 

i—1 j — 1 m— 1 

r ifi if 2 r A"i X 2 

= constant + E E E e §™ log ft + EEE e S™ log ^ + 

i—1 j — 1 rn—1 i—1 j — 1 rn — 1 

Ti(p) T 2 (7r) 

r ATi K 2 

E E E 6 y°™ ^ l0g A J + Vl l °SP( a m + X[f3) - Xjp{a m + X'i0)} 
i—1 j — 1 m—1 

T 3 (A,a,/3) 

is maximized over p, n, A, a and /3 sequentially. Because p, n and (A, oc, (3) are in T\(p), T2(n) and Ts(A, a, (3) 
separately, their maximum likelihood estimators (MLEs) can be found individually. The MLE for p is 

r K 2 

pS^^EEi^ 1 - 2 ^ < 3 > 

i=l m—1 

The MLE for tt is 

7r«=r- 1 X;2e§3L,m=l ) 2,...,Jf a . (4) 
i=i j=i 
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We will maximize T^A, a, (3) over A, a and (3 sequentially. The conditional MLE for A given a = and 

f3 = /3< 0) is 

^(1) _ 2-ii=l 2-im=l e ijmVi ' — 1 2 K (5) 

J2i=l Em=l e ijmP( a "^ + 21 i/^ ^ 

There are no simple analytic forms for the conditional MLEs of a and (3. The conditional MLE for a. given 
A = A (1) and (3 = /3 (0) is 

r Ki 

a m = argmax^ ^ e>°^ {yi log Xj + y t \ogp{a m + x'j3) - Xjp(a m + x'^)} , (6) 
c* m en i=lj=1 

for m = 1,2, ... , i^2. The conditional MLE for /3 given A = A^ 1 -* and a = is 

(3 {1) = argmaxT 3 (A (1) , a (1) , 0). (7) 

The function optim in R can be used to get the MLEs of a and (3 in equations (|6j and (0. 

2.3 Selecting the number of support points 

The maximized log likelihood £(9) can be increased by increasing the number of support points of G or H. We 
propose to choose the number of support points by minimizing the BIC (e.g., Wang et al. 1996) to obtain a reasonable 
and parsimonious fit to the data, i.e., 

(Ki,K 2 )= argmin {-2t(0) + log(r) [2(Ki + K 2 ) - 2 + g]}. 

(Ki,K 2 )e{l,2,...}2 

Forward model selection is used. For a fixed K\, if the BIC stops to decrease for larger K 2 , the models with 
greater K 2 will not be considered for this K\. The strategy is the same for K 2 fixed and K\ changed. From all the 
models considered, the one with the minimum BIC is chosen. 

2.4 The bootstrap method 

The confidence intervals for the regression coefficients (3 can be got by the bootstrap method. The nonparametric 
bootstrap method may be applied for a random design, in which one can sample the pairs (j/j, Xj). For a fixed design, 
a parametric bootstrap method is recommended. A resample of size r is generated as follows, 

V* ~ f(y,Xi,/3,\i,a t ),i = 1,2, ...,r, 

where Xi and on are random variables drawn from the estimated mixing distributions G and H, respectively, where 

G = y^p j S(X j ) and H = ^2 7r m S(a m ). 

j — 1 m—1 

3 Simulation 

In the simulation, the parameter setting takes a 2 3 design, 

{—2, 3} x {Gi,G 2 } x {Hi,H 2 }, 

13 G H 
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where G x = 0.1(5(100) + 0.8(5(200) + 0.1(5(300), G 2 = 0.55(10) + 0.55(50), H x = 0.3<5(-2)+0.3<5(0.4) + 0.4(5(3), 
and H 2 — 0.25<5(— 2) + 0.755(1.5). There is a single covariate x in the simulation. For each integer x in [—5, 5] , 10 
values of y are drawn independently from a Possion distribution with mean Xp(a + x(3), where A and a are random 
variables drawn from G and H. So the sample size r is 110. 

For each parameter setting, 200 samples are drawn. Table [2 presents the simulation results. The bias, standard 
deviation and mean square error of are small, and each (3 falls into its 95% quantile interval, with 2.5% and 97.5% 
quantiles as endpoints. 

Table 1: Simulation results: sd stands for standard deviation, qi for 95% quantile intervals and mse for mean square 
error. 



setting 


beta 


G 


H 


bias 


sd 




qi 


mse 


1 


-2 


Gi 


Hx 


-0.00 


0.08 


(• 


-2.15, -1.86) 


0.01 


2 


-2 


Gx 


H 2 


-0.01 


0.09 


(- 


-2.17, -1.85) 


0.01 


3 


-2 


G 2 


Hx 


-0.11 


0.32 


(- 


-2.77, -1.62) 


0.12 


4 


-2 


G 2 


H 2 


-0.07 


0.24 


(• 


-2.69, -1.68) 


0.06 


5 


3 


Gx 


Hx 


0.00 


0.16 




(2.72, 3.27) 


0.02 


6 


3 


Gx 


H 2 


0.02 


0.16 




(2.72, 3.33) 


0.02 


7 


3 


G 2 


Hx 


0.17 


0.68 




(2.20, 4.50) 


0.49 


8 


3 


G 2 


H 2 


0.03 


0.45 




(2.36, 4.23) 


0.20 



4 Example 

4.1 M. Bovis data 

Table|2]is part of Table 1 of Trajstman (1989), and also appeared in Morgan and Smith (1992). One of the decontam- 
inants, HPC or oxalic acid, with a specific concentration was applied on a group of M. bovis cells, which were then 
placed on the culture plates for colony formation. After 12 weeks (at stationarity), the number of M. bovis colonies 
were counted, which is equal to the number of surviving M. bovis cells. 

An ANOVA model is fitted with a separate factor for each level of the decontaminants. Let Xj denote a factor for 
the concentration level j of the decontaminants. It is assumed that the pi satisfy that 

11 

1os {t^~} + i= 1,2,..., 129, (8) 

P* j=i 

where 77 is the control effect and j3j is the effect difference between dose j and the control one, j = 1, 2, . . . , 11. 

From Table |3 the case in which G and H with 2 support points has the minimum BIC. Therefore, the MLEs of G 
and H are G = 0.045(12.41) + 0.965(99.32), and H = 0.825(-0.03) + 0.185(0.63), respectively. 

Table0]presents the estimated regression coefficients, their bootstrap standard errors and 95% confidence intervals 
from 200 bootstrap resamples. The MLEs /3@ and (3g violate the monotonic dose-response relationship, i.e., the larger 
dose do not produce stronger effects here. This finding is consistent with the monotonicity violation of their sample 
means in Table [2] Nine out of eleven confidence intervals do not include 0, so the corresponding doses have signifi- 
cantly stronger negative effects than the control one. Two exceptions are those of (3§ and (3xx- An explanation is that 
the Oxalic acid dose 0.005 is too small to take any different effect on the M. Bovis cells from the control one. The 
estimates of j3 in Table 0] can not be compared to those of Trajstman (1989) and Morgan and Smith (1992), because 
they used a simple linear model in (|8jl. Figure^presents the responses y, the sample means y and the fitted values y. 
The model seems fit well. 
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Table 2: The M. bovis cell survival data. 



%weight/volume No. of M. bovis colonies at stationarity sample mean 



control experiment (no decontaminant) 





52 


80 


55 


50 


58 


50 


43 


50 


53 


54 


51.8 




44 


51 


34 


37 


46 


56 


64 


51 


67 


40 




[HPC] 








decontaminant: HPC 










0.75 


2 


4 


8 


9 


10 


1 





5 


14 


7 


6.0 


0.375 


11 


12 


13 


12 


11 


13 


17 


16 


21 


2 


12.8 


0.1875 


16 


6 


20 


23 


23 


39 


18 


23 


33 


21 


22.2 


0.09375 


33 


46 


42 


18 


35 


20 


19 


29 


41 


36 


31.9 


0.075 


30 


30 


27 


53 


51 


39 


31 


36 


38 


22 


35.7 


0.0075 


53 


62 


38 


54 


54 


38 


46 


58 


54 


57 


51.4 


0.00075 


3 


42 


45 


49 


32 


39 


40 


34 


45 


51 


38.0 


[Oxalic acid] 






decontaminant: oxalic acid 








5 


14 


15 


6 


13 


4 


1 


9 


6 


12 


13 


9.3 


0.5 


27 


33 


31 


30 


26 


41 


33 


40 


31 


20 


31.2 


0.05 


33 


26 


32 


24 


30 


52 


28 


28 


26 


22 


30.1 


0.005 


36 


54 


31 


37 


50 


73 


44 


50 


37 




45.8 



o y 

• y 

□ $ 




Figure 1 : The response y, sample mean y and fitted value y. 
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Table 3: The BIC for the estimated mixing distributions. 



K X \K 2 


1 


2 


3 


4 


1 


1061.1 


991.3 


977.6 


985.1 


2 


998.0 


976.9 


978.3 


988.9 


3 


977.0 


978.7 


994.3 




4 


984.2 


988.5 







Table 4: The estimated regression coefficients, bootstrapped standard error, and 95% confidence interval for the M. 
Bovis data. 



dose 





MLE 


se 




95% ci 


HPC 














0.75 




-2.74 


0.55 


(- 


-4.29, - 


-2.36) 


0.375 


ih 


-1.86 


0.51 


(- 


-3.38, - 


-1.55) 


0.1875 


03 


-1.25 


0.48 


(- 


-2.59, - 


-0.96) 


0.09375 


fh 


-0.98 


0.45 


(- 


-2.12, - 


-0.69) 


0.075 


05 


-0.72 


0.42 


(- 


-1.72, - 


-0.48) 


0.0075 


06 


0.04 


0.94 


(- 


-0.25, 


1.38) 


0.00075 


07 


-0.35 


0.38 


(- 


-1.14, - 


-0.07) 


Oxalic acid 














5 


08 


-2.30 


0.53 


(- 


-3.63, - 


-1.96) 


0.5 


09 


-0.84 


0.45 


(- 


-1.99, - 


-0.55) 


0.05 


0w 


-0.90 


0.41 


(- 


-2.11, - 


-0.68) 


0.005 


0ii 


-0.28 


0.38 


(- 


-0.97, 


0.08) 



4.2 Jejunal crypt data 

The jejunal crypt data are referred to Table 1 of Elder et al. (1999), which are also studied by Kim and Taylor (1994). 
There are 126 live mice divided into groups, not all of equal sizes. The treatment consists of exposing each group 
of mice to a certain dose of gamma rays, and then killing them to find out the number of surviving crypts. The total 
number of crypts in each mouse is unknown, because the experiment needs live mice. It is assumed that the surviving 
probabilities pi satisfy that 

log | 1 =r, + Xi , t=l,2,... ,126, 

where Xi is the gamma dose. 

The BIC are 724.2 for K x = K 2 = 1, 733.9 for Ki = 2, K2 = 1 and K x = 1, K 2 = 2. Thus, the estimated G 
is degenerated at a — 6.705, and H at A = 196.1. We draw 200 bootstrap resamples. Table[5]presents the estimation 
results of the proposed model and the previous ones. The bootstrap standard error of (3 is quite small and its 95% 
confidence interval is (—1.225, —1.029). All listed estimates of lie in our confidence interval. Since this interval 
does not cover 0, the j3 in the proposed model is significant at the significance level of 0.05. 



5 Discussion 

We propose a flexible semiparametric model to incorporate overdispersion due to the variation of and Pi. The 
regression coefficients are estimated with the nuisance parameters, the mixing distributions in a seamless fashion. 
Although a logistic dose-response relation is assumed, it can be extended to other links very easily. When one runs 
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Table 5: Jejunal crypt data results from the proposed and previous approaches (logistic regression and Kim's method 
fix Hi and E(rii) at 160, respectively; Kim's and Elder's quasi-likelihood method of moments estimates come from 
Elder etal. (1999)). 





estimate (standard error) 




logistic 


Kim's Elder's 


proposed 


a 


7.432 (0.175) 


7.410(0.191) 6.727(0.725) 


6.705 


(3 


-1.185 (0.024) 


-1.183(0.026) -1.126(0.061) 


- 1.124(0.044) 


A 




_ 194.7 (43.4) 


196.1 



the ECM algorithm, a Poisson regression is suggested to run first to obtain a good initial value of j3. 
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